-
Couldn't load subscription status.
- Fork 565
reduce GRPC transient failures #1186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@jpeeler: This pull request references Bugzilla bug 1768819, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jpeeler The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest |
2 similar comments
|
/retest |
|
/retest |
3dbcda4 to
6d1bfe0
Compare
6d1bfe0 to
173351e
Compare
| } | ||
|
|
||
| if len(pods) > 0 { | ||
| for _, cond := range pods[0].Status.Conditions { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we are getting an array of pods but only checking the condition of the pod with index 0 so the first one. Can we can always expect to get one pod in the pods list, or if the first one is ready then the others must be as well? If we only look at the one, can we use a Get instead of a List when making the request on L635?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrm, good point. I'm pretty sure the mapping is always 1:1...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out that Get can only be queried by name, so I'm sticking with List here.
| return | ||
| } | ||
|
|
||
| logger.Infof("backing pod not yet ready") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we could add "pod status %s", pods[0].Status.Message to this log message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a good compromise is to indicate the pod name here? The message seems to be blank in my testing and printing the whole status block is going to be too much.
|
Interesting how this only works for image based catalog sources but not those based off config maps. The only real difference for the configmap based ones is that we build have to seed the DB manually from the configmap before serving gRPC connections. I wonder if they gRPC healthprobe is trying to query the pod before that action is complete, and after a series of failures it starts to back off and act funny. I know we were looking at something like this in the BZ. Maybe we can test the configmap case with a delay in the readiness probe (until after the gRPC connection is ready) and see if that improves behavior? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this @jpeeler! Looking nice.
For now, I just have some thoughts on implementation:
| return | ||
| } | ||
|
|
||
| func (o *Operator) waitForBackingPod(logger *logrus.Entry, in *v1alpha1.CatalogSource) (out *v1alpha1.CatalogSource, continueSync bool, syncError error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name makes me think that this method is blocking, when it's not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it to checkBackingPodStatus.
| chain := []CatalogSourceSyncFunc{ | ||
| o.syncConfigMap, | ||
| o.syncRegistryServer, | ||
| o.waitForBackingPod, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't this also be executed for "address" type CatalogSources (i.e. no pods to check readiness for)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I honestly totally forgot about catalog sources of that type. I ended up adding a simple early return for that scenario.
| return | ||
| } | ||
|
|
||
| func (o *Operator) waitForBackingPod(logger *logrus.Entry, in *v1alpha1.CatalogSource) (out *v1alpha1.CatalogSource, continueSync bool, syncError error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be a member of the registry reconciler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be, but I'm not sure it should be. With the way the code is structured and higher level operations (which I consider pod checking to be) present in the main operator code, it seems best to me where it is. There is a comment to add memoization for ReconcileForSource, which if done later on would make the requested refactoring a bit less invasive.
| for _, cond := range pods[0].Status.Conditions { | ||
| if cond.Type == corev1.ContainersReady && cond.Status == corev1.ConditionTrue { | ||
| continueSync = true | ||
| return | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be a member of the pod decorator used by the registry reconciler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me, this might be more appropriate if the shared code (in this case pod status checking) was not shared by the majority of catalog source types (configmap, grpc). The recently added if statement to ignore grpc address types isn't too ugly, do you agree?
|
|
||
| crc := newCRClient(t) | ||
| selector := labels.SelectorFromSet(map[string]string{reconciler.CatalogSourceLabelKey: sourceName}) | ||
| catSrcWatcher, err := crc.OperatorsV1alpha1().CatalogSources(testNamespace).Watch(metav1.ListOptions{LabelSelector: selector.String()}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! You can also use a field selector on the name instead of using a label.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since awaitPods also uses a label selector, will stick with consistency.
| for { | ||
| select { | ||
| case <-ctx.Done(): | ||
| return | ||
| case evt, ok := <-catSrcWatcher.ResultChan(): | ||
| if !ok { | ||
| errExit <- errors.New("watch channel closed unexpectedly") | ||
| return | ||
| } | ||
| if evt.Type == watch.Modified { | ||
| catSrc, ok := evt.Object.(*v1alpha1.CatalogSource) | ||
| if !ok { | ||
| errExit <- errors.New("watch returned unexpected object type") | ||
| return | ||
| } | ||
| t.Logf("connectionState=%v", catSrc.Status.GRPCConnectionState) | ||
| if catSrc.Status.GRPCConnectionState != nil { | ||
| if catSrc.Status.GRPCConnectionState.LastObservedState == connectivity.Ready.String() { | ||
| done <- struct{}{} | ||
| } | ||
| require.NotEqual(t, connectivity.TransientFailure.String(), catSrc.Status.GRPCConnectionState.LastObservedState) | ||
| } | ||
| } | ||
|
|
||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the watch was started earler, I think this can be merged with the for {...} at the bottom of this test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to miss any state changes, so I really want to start monitoring the watch before the catalog source is created. Since the result chan is unbuffered, I think that's the only safe way to do it, right?
| } | ||
| op.sources = grpc.NewSourceStore(logger, 10*time.Second, 10*time.Minute, op.syncSourceState) | ||
| op.reconciler = reconciler.NewRegistryReconcilerFactory(lister, opClient, configmapRegistryImage, op.now) | ||
| op.reconciler = reconciler.NewRegistryReconcilerFactory(lister, opClient, configmapRegistryImage, op.now, op.logger.IsLevelEnabled(logrus.DebugLevel)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would lean towards just passing in a logrus.Entry instance instead of a debug boolean
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the reason why so you can be able to have access to the operator logger in places that aren't currently possible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've done so now - at least to the level of generating the pod spec.
173351e to
c9ec546
Compare
c9ec546 to
1c58f8f
Compare
1c58f8f to
420274c
Compare
|
/retest |
| } | ||
| return | ||
| } | ||
| if err := o.catsrcQueueSet.Requeue(state.Key.Namespace, state.Key.Name); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this just requeuing after every state change? is that desired?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually do so that the status stays in sync with what the grpc connection is reporting. This only adds additional requeues when the connection is reported as ready as requeues were already being done otherwise. I could refactor since a lot of the above is similar if you agree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went ahead and refactored it.
|
|
||
| out = in.DeepCopy() | ||
|
|
||
| selector := map[string]string{reconciler.CatalogSourceLabelKey: in.GetName()} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this use the pod decorator to get this selector?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't see a way to have access to the decorator at this level, so I refactored to put some commonality into the reconciler package and use that instead.
aa7152e to
b4e2c24
Compare
b4e2c24 to
608dfac
Compare
608dfac to
a5ab960
Compare
|
/test unit |
This is an attempt to eradicate all the transient failures that GRPC connections occasionally report. These "failures" usually resolve after about 20 seconds or so, but seeing the word failure is really unsettling to users. The fix here is to confirm that a catalog source pod readiness check has been completed before attempting to set up a GRPC connection.
The flags weren't being parsed correctly and was crashing, so just remove this unnecessary code.
Also adds some labeling on some test pods, but the new code specifically ignores them because they are for grpc address pods.
Allow the launched catalog pods to be aware of whether or not the log level is set to debug level and add additional GRPC logging if so. In the future other debugging can be added without the somewhat large amount of changes as seen here.
a5ab960 to
6fe3988
Compare
|
/retest |
1 similar comment
|
/retest |
|
/retest |
|
/retest |
|
@jpeeler: No Bugzilla bug is referenced in the title of this pull request. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Description of the change:
This adds an additional function in the syncCatalogSource sync chain to ensure the GRPC readiness probes are reporting as ready before attempting any GPRC operations.
Motivation for the change:
The idea is to reduce transient errors as reported on the catalog source status. Unfortunately, I've only been able to reliably improve the catalog sources that are of type grpc and not the ones that are backed by configmaps (which is what the reported bug is specifically about).
There's also some work done to make debugging easier in the future.
Reviewer Checklist
/docs